Loan Default Prediction¶

Problem Definition¶

The Context:¶

The banking industry heavily relies on the profitability of home loans, which are primarily sought after by individuals with regular or substantial incomes. However, the specter of loan defaults presents a significant financial risk, potentially negating the profits from these loans. Historically, banks have navigated the loan approval process through meticulous manual reviews, a method that, while thorough, is prone to inefficiencies, human error, and biases. With the evolution of technology, there's been a shift towards automating this process to enhance efficiency and objectivity. The advent of data science and machine learning offers a promising avenue for developing sophisticated models that can predict loan default risks more accurately, thereby making the loan approval process more streamlined, unbiased, and effective.

The objective:¶

The primary aim is to revolutionize the loan approval process by implementing a classification model that predicts the likelihood of loan defaults. This initiative, rooted in the principles of the Equal Credit Opportunity Act, seeks to harness recent loan application data and insights from the bank's loan underwriting process to construct a model grounded in empirical data and statistical validity. The model is expected not only to enhance predictive accuracy but also to ensure transparency and fairness, particularly in rejection cases. By identifying key predictive factors, the model will provide actionable insights, enabling the bank to make more informed decisions, optimize the approval process, and ultimately minimize the risk of defaults.

The key questions:¶

What are the primary factors contributing to loan defaults?

Identifying the most influential variables that predict default can help in tailoring the model to focus on the most relevant data points. For that we have to do EDA, and model relationships

How can the model incorporate the guidelines of the Equal Credit Opportunity Act to ensure fairness and avoid bias?

To ensure the predictive model for loan approvals aligns with the Equal Credit Opportunity Act and utilizes the available data effectively, it's essential to focus on non-discriminatory, financially relevant features like income, debt levels, payment history, and assets, while excluding or carefully scrutinizing variables that could indirectly relate to protected characteristics. Incorporating bias detection and mitigation strategies throughout the model's development and application phases is crucial, employing statistical analysis to identify biases and applying algorithms to reduce them. Emphasizing model interpretability and transparency allows for the provision of clear, understandable justifications for credit decisions, meeting ECOA requirements. Regular validation, continuous monitoring for fairness, and legal compliance reviews ensure the model remains unbiased and effective over time. Feedback mechanisms further refine the model, ensuring it reflects equitable credit decision practices while leveraging data points like "DEBTINC," "CLAGE," "DELINQ," and "DEROG" to assess creditworthiness comprehensively and fairly.

What predictive modeling techniques will be most effective and interpretable for this application?

Determining the balance between model complexity, accuracy, and interpretability to ensure that decisions can be explained and justified.

Logistic Regression

Pros:

  • Highly interpretable: The influence of each predictor on the outcome is quantified by coefficients, making it straightforward to understand and explain.
  • Simplicity: It’s easy to implement and efficient to train, ideal for baseline models.

Cons:

  • Assumes Linearity: It assumes a linear relationship between the independent variables and the log odds of the dependent variable, which might not always hold true.
  • Limited Complexity: Might not capture complex nonlinear relationships as effectively as tree-based methods.

Decision Trees

Pros:

  • Transparent Decision Process: The hierarchical structure of decisions based on feature values is easy to visualize and understand.

  • Handles Non-linearity: Can model complex relationships without needing the data to be linearly separable.

Cons:

  • Overfitting: Tends to overfit the training data, making the model less generalizable to unseen data.

  • Instability: Small changes in the data can result in significantly different trees.

Random Forest

Pros:

  • Improved Accuracy: By averaging multiple decision trees, it reduces the variance and avoids the overfitting issue of individual trees. Feature Importance: Provides insights into which variables are most influential in predicting the outcome.

Cons:

  • Less Interpretability: While still more interpretable than more complex models, the ensemble nature of random forests makes them less transparent than single decision trees.
  • Computationally Intensive: Requires more computational resources and time to train and predict compared to simpler models.

How can the bank use the model's insights to optimize its loan approval process?

Translating model predictions into actionable strategies for assessing loan applications can enhance decision-making efficiency and accuracy. This will be answered at the end

What measures will be taken to validate the model's predictions and assess its performance over time?

Establishing criteria for model evaluation and continuous improvement to ensure it remains accurate and relevant as new data becomes available. For training we could use Cross-Validation to assess the model's performance across different subsets of the data. This helps in estimating the model's generalization ability to unseen data and a Confusion Matrix to Evaluate the model's performance in terms of precision, recall, accuracy, and F1 score. This is particularly important for classification models to understand the trade-offs between different types of errors (false positives and false negatives).

How will adverse actions (loan rejections) be communicated to ensure transparency and provide justification based on the model’s findings?

The model plays a crucial role in assisting the bank to address the issues of transparency and fairness in communicating adverse actions, like loan rejections, by providing clear, interpretable insights into the decision-making process. By leveraging interpretable modeling techniques and incorporating explanation frameworks, the model can identify and communicate the specific reasons contributing to a loan rejection, such as high debt-to-income ratios or insufficient credit history, in a manner that is understandable to applicants.

The problem formulation:¶

The overarching goal is to streamline and improve the loan approval process, making it more efficient, fair, and free from biases. By developing a classification model based on empirical data and statistical analysis, we seek to:

  • Enhance Decision-Making Accuracy: Automate the prediction of loan default risk with a high degree of accuracy, allowing the bank to make informed lending decisions.

  • Ensure Fairness and Compliance: Adhere to the Equal Credit Opportunity Act's guidelines, ensuring that the model's decisions are devoid of biases that could unfairly affect certain groups of applicants.

  • Improve Efficiency: Reduce the time and resources currently required for manual loan approval processes, thereby increasing operational efficiency.

  • Maintain Transparency: Build a model that is not only predictive but also interpretable, enabling the bank to provide clear justifications for loan approvals or rejections, thus maintaining transparency with applicants.

  • Identify Key Predictive Features: Determine the most significant factors that predict loan defaults, offering insights that can guide the bank's policies and strategies regarding loan approvals.

Data Description:¶

The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.

  • BAD: 1 = Client defaulted on loan, 0 = loan repaid

  • LOAN: Amount of loan approved.

  • MORTDUE: Amount due on the existing mortgage.

  • VALUE: Current value of the property.

  • REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)

  • JOB: The type of job that loan applicant has such as manager, self, etc.

  • YOJ: Years at present job.

  • DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).

  • DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).

  • CLAGE: Age of the oldest credit line in months.

  • NINQ: Number of recent credit inquiries.

  • CLNO: Number of existing credit lines.

  • DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.

Import the necessary libraries and Data¶

In [ ]:
import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from scipy.stats import chi2_contingency, ttest_ind
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_squared_error
from sklearn.ensemble import RandomForestClassifier

from scipy.stats.mstats import winsorize
In [ ]:
from google.colab import drive
drive.mount('/content/drive/')
Mounted at /content/drive/
In [ ]:
data = pd.read_csv('/content/drive/My Drive/Colab_Notebooks/hmeq.csv')
data.shape
Out[ ]:
(5960, 13)

Data Overview¶

  • Reading the dataset
  • Understanding the shape of the dataset
  • Checking the data types
  • Checking for missing values
  • Checking for duplicated values
In [ ]:
# get the first 5 rows of the data
data.head()
Out[ ]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
0 1 1100 25860.0 39025.0 HomeImp Other 10.5 0.0 0.0 94.366667 1.0 9.0 NaN
1 1 1300 70053.0 68400.0 HomeImp Other 7.0 0.0 2.0 121.833333 0.0 14.0 NaN
2 1 1500 13500.0 16700.0 HomeImp Other 4.0 0.0 0.0 149.466667 1.0 10.0 NaN
3 1 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0 1700 97800.0 112000.0 HomeImp Office 3.0 0.0 0.0 93.333333 0.0 14.0 NaN
In [ ]:
# get the last 5 rows of the data
data.tail()
Out[ ]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
5955 0 88900 57264.0 90185.0 DebtCon Other 16.0 0.0 0.0 221.808718 0.0 16.0 36.112347
5956 0 89000 54576.0 92937.0 DebtCon Other 16.0 0.0 0.0 208.692070 0.0 15.0 35.859971
5957 0 89200 54045.0 92924.0 DebtCon Other 15.0 0.0 0.0 212.279697 0.0 15.0 35.556590
5958 0 89800 50370.0 91861.0 DebtCon Other 14.0 0.0 0.0 213.892709 0.0 16.0 34.340882
5959 0 89900 48811.0 88934.0 DebtCon Other 15.0 0.0 0.0 219.601002 0.0 16.0 34.571519
In [ ]:
# get data datatypes and non-nulls
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 605.4+ KB
In [ ]:
duplicates = data.duplicated()
duplicates
Out[ ]:
0       False
1       False
2       False
3       False
4       False
        ...  
5955    False
5956    False
5957    False
5958    False
5959    False
Length: 5960, dtype: bool

There are no duplicates in the dataset

In [ ]:
#knowing the percentage of null data
total_nulls = data.isnull().sum().sum()
print(f"Total null values in DataFrame: {total_nulls}")
null_percentage = data.isnull().mean() * 100
print(f"Null values percenge per column")
print(null_percentage)
Total null values in DataFrame: 5271
Null values percenge per column
BAD         0.000000
LOAN        0.000000
MORTDUE     8.691275
VALUE       1.879195
REASON      4.228188
JOB         4.681208
YOJ         8.640940
DEROG      11.879195
DELINQ      9.731544
CLAGE       5.167785
NINQ        8.557047
CLNO        3.724832
DEBTINC    21.258389
dtype: float64
In [ ]:
msno.matrix(data)
plt.show()

Our dataset contains missing values across various columns, with the proportion of missing data ranging from approximately 1.88% to 21.26% across a total of 5060 rows. This situation requires us to formulate certain assumptions about the nature and impact of these missing values.

Given the substantial presence of missing data, it's crucial to incorporate this consideration into our exploratory data analysis (EDA). We aim to understand the pattern of missingness, determining whether the data are missing at random (MAR), missing completely at random (MCAR), or missing not at random (MNAR). Additionally, we need to examine the inter-variable relationships of the missing data to prevent the introduction of biases. This careful approach ensures a more accurate analysis and interpretation of our dataset.

Summary Statistics¶

In [ ]:
data.describe()
Out[ ]:
BAD LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
count 5960.000000 5960.000000 5442.000000 5848.000000 5445.000000 5252.000000 5380.000000 5652.000000 5450.000000 5738.000000 4693.000000
mean 0.199497 18607.969799 73760.817200 101776.048741 8.922268 0.254570 0.449442 179.766275 1.186055 21.296096 33.779915
std 0.399656 11207.480417 44457.609458 57385.775334 7.573982 0.846047 1.127266 85.810092 1.728675 10.138933 8.601746
min 0.000000 1100.000000 2063.000000 8000.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.524499
25% 0.000000 11100.000000 46276.000000 66075.500000 3.000000 0.000000 0.000000 115.116702 0.000000 15.000000 29.140031
50% 0.000000 16300.000000 65019.000000 89235.500000 7.000000 0.000000 0.000000 173.466667 1.000000 20.000000 34.818262
75% 0.000000 23300.000000 91488.000000 119824.250000 13.000000 0.000000 0.000000 231.562278 2.000000 26.000000 39.003141
max 1.000000 89900.000000 399550.000000 855909.000000 41.000000 10.000000 15.000000 1168.233561 17.000000 71.000000 203.312149

Exploratory Data Analysis (EDA) and Visualization¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Leading Questions:

  1. What is the range of values for the loan amount variable "LOAN"?
  2. How does the distribution of years at present job "YOJ" vary across the dataset?
  3. How many unique categories are there in the REASON variable?
  4. What is the most common category in the JOB variable?
  5. Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?
  6. Do applicants who default have a significantly different loan amount compared to those who repay their loan?
  7. Is there a correlation between the value of the property and the loan default rate?
  8. Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?

Preliminary Intuition behind the data

This is a raw intuation before seeing the data, but could lead decisions on how to handle outliers and missing data and how can influence defaults

BAD (Loan Default Indicator): This binary variable is the target outcome, indicating whether a loan was repaid or defaulted. It directly reflects the risk the bank is trying to mitigate.

LOAN (Amount of Loan Approved): Larger loan amounts may correlate with a higher risk of default due to the increased financial burden on the borrower.

MORTDUE (Amount Owed on Mortgage): A higher mortgage due could indicate financial strain, potentially increasing the likelihood of default, especially if it significantly outweighs the borrower's assets or income.

VALUE (Current Value of Property): Properties with higher values may indicate borrowers with more assets, possibly correlating with a lower default risk. However, market fluctuations can affect property values and, subsequently, this relationship.

REASON (Reason for Loan): typically categorized as "HomeImp" for home improvement or "DebtCon" for debt consolidation, can provide valuable context for assessing default risk. Loans taken out for home improvement ("HomeImp") might suggest an investment in the property's value, potentially indicating financial stability and planning. Conversely, loans for debt consolidation ("DebtCon") could indicate attempts to manage existing financial strain or overextension, which might carry a different risk profile.

JOB (Type of Job): The job type can provide insights into income stability and levels. Certain professions might inherently carry more stable income prospects, affecting default risk.

YOJ (Years at Present Job): Longer employment duration may suggest job stability, which could correlate with a lower risk of default due to consistent income.

DEROG (Number of Major Derogatory Reports): A higher number of derogatory marks on a borrower's credit report can be a strong predictor of default, reflecting past difficulties in managing credit.

DELINQ (Number of Delinquent Credit Lines): Similar to DEROG, a higher number of delinquent accounts may indicate trouble managing debt obligations, potentially predicting future default risk.

CLAGE (Age of Oldest Credit Line in Months): Older credit lines might imply a longer credit history and, possibly, more financial experience and stability, which could correlate with a lower risk of default.

NINQ (Number of Recent Credit Inquiries): A high number of recent inquiries could suggest financial distress or overextension, potentially increasing default risk.

CLNO (Number of Existing Credit Lines): This could be a double-edged sword; more credit lines might indicate creditworthiness and financial management skills but could also suggest potential overextension.

DEBTINC (Debt-to-Income Ratio): A higher ratio might indicate that a significant portion of the borrower's income is dedicated to debt repayment, potentially increasing the risk of default due to limited financial flexibility.

Univariate Analysis¶

To begin our analysis, let's first review the current state of our dataset. It has been observed that there are missing values present across various variables. Given the significant proportion of these missing values, simply discarding observations with missing data is not an optimal approach, as it could introduce bias and potentially distort the underlying relationships between variables. Therefore, our next step will be to methodically examine each variable to understand both its distribution and the extent of its missing values. This approach will enable us to devise more informed strategies for handling these missing values effectively.

In [ ]:
## Numerical data
numerical_columns = ["LOAN","MORTDUE", "VALUE", "YOJ", "DEROG", "DELINQ", "CLAGE", "NINQ","CLNO", "DEBTINC"]
## Categorical data
categorical_columns = ["JOB", "REASON"]
In [ ]:
def plot_distribution_and_boxplot(dataset, column_name):
    # Calculate the percentage of missing data
    missing_percentage = dataset[column_name].isnull().mean() * 100

    # Creating the subplot structure
    fig, axs = plt.subplots(2, 1, figsize=(10, 8), gridspec_kw={'height_ratios': [3, 1], 'hspace': 0.5})

    # Histogram with Density Plot on the first subplot
    sns.histplot(dataset[column_name].dropna(), kde=True, bins=30, color='skyblue', ax=axs[0])
    axs[0].axvline(dataset[column_name].mean(), color='red', linestyle='--', linewidth=2)
    axs[0].set_title(f'{column_name} Distribution - Missing Data: {missing_percentage:.2f}%')
    axs[0].set_xlabel(column_name)
    axs[0].set_ylabel('Frequency')

    # Boxplot on the second subplot
    sns.boxplot(x=dataset[column_name], color='lightblue', ax=axs[1], showmeans=True)
    axs[1].set_title(f'{column_name} Boxplot')
    axs[1].set_xlabel(column_name)
    axs[1].set_ylabel('')

    plt.show()

for column in numerical_columns:
    plot_distribution_and_boxplot(data, column)
In [ ]:
def plot_categorical_distribution(data, column_name):
    # Calculate counts and percentages
    counts = data[column_name].value_counts(dropna=False)  # Include NaN values in the count
    total = data.shape[0]  # Total number of rows to consider NaN in calculation
    percentages = 100 * counts / total

    # Calculate missing data percentage
    missing_percentage = 100 * data[column_name].isnull().sum() / total

    # Function for formatting autopct
    def autopct_format():
        def my_format(pct):
            total_count = int(round(pct*total/100.0))
            # Adjust to consider NaN if included in counts with dropna=False
            return '{:.1f}%\n({:d})'.format(pct, total_count)
        return my_format  # Return the function itself

    # Plot
    plt.figure(figsize=(8, 8))
    plt.pie(counts, labels=counts.index, autopct=autopct_format(), startangle=140)
    plt.title(f'{column_name} Distribution - Missing Data: {missing_percentage:.2f}%')
    plt.show()


for column in ["BAD"] + categorical_columns:
    plot_categorical_distribution(data, column)

Treating missing values¶

These assumptions can help guide strategies for handling the missing data and understanding its potential impact on predictive modeling efforts:

Random vs. Non-Random Missing Data:

Assumption: The missingness in data like "MORTDUE" (mortgage due), "DEBTINC" (debt-to-income ratio), and others might not be entirely random. For instance, missing "DEBTINC" values could be more common in applicants with complex income sources that are harder to document, or missing "JOB" information could be linked to self-employed applicants who might categorize their employment differently.

Implication: If missingness is non-random, simply ignoring or removing these cases could introduce bias or affect the model's accuracy. Analyzing the pattern of missing data can offer insights into its nature and guide appropriate imputation strategies.

Missingness Related to Applicant Characteristics:

Assumption: Missing values in "DEROG" (derogatory reports) and "DELINQ" (delinquencies) might relate to applicants with newer credit histories who haven’t encountered situations leading to derogatory remarks or delinquencies, hence the lack of recorded incidents.

Implication: This assumption suggests that for some variables, the absence of data could itself be informative, potentially indicating lower risk profiles for certain applicants.

Impact of Missing Data on Predictive Power:

Assumption: Columns with higher percentages of missing data, such as "DEBTINC" with over 21% missing, could have a significant impact on the model’s ability to accurately predict loan defaults if the missing data is not adequately addressed.

Implication: High levels of missing data in key predictive variables necessitate careful consideration of imputation methods to preserve or enhance the model’s predictive accuracy.

Correlation Between Missingness and Other Variables:

Assumption: The occurrence of missing data in one variable may be related to the presence or absence of data in another. For example, missing "VALUE" data might coincide with missing "MORTDUE" information, possibly because both are related to the applicant's property.

Implication: Understanding correlations between missing data across variables can inform multivariate imputation techniques, which consider these relationships to fill in missing values more accurately.

Missing Data as a Separate Category:

Assumption: For categorical variables like "JOB" and "REASON", the missingness could be treated as a separate category during analysis, under the assumption that not providing this information may itself be indicative of certain borrower behaviors or characteristics.

Implication: This approach can preserve the data's structure and provide additional insights into how the absence of certain information relates to loan default risk.

Lets try to fix missing data first, that will allow us to comprehed better the relationship between features, also we could start the univariable analysis at the same time

In [ ]:
msno.heatmap(data)
Out[ ]:
<Axes: >
In [ ]:
# Visualize missingness using a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(data.isnull(), cbar=False, cmap='viridis')
plt.title('Data Heatmap - Missing Values')
plt.xlabel('Variables')
plt.ylabel('Observations')
plt.show()

# Visualize missingness using a dendrogram
plt.figure(figsize=(10, 6))
msno.dendrogram(data)
plt.title('Dendrogram - Missing Values')
plt.show()
<Figure size 1000x600 with 0 Axes>

Understanding the patterns of missing data, including Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR), is essential for robust data analysis and modeling. MCAR refers to missing data where the probability of being missing is unrelated to any observed or unobserved data.

That will help us to choose the best imputation strategy per feature

lets test each feature to see its missing data pattern behavior

In [ ]:
## perform MCAR test

# Function to perform Chi-Square Test for categorical variables
def chi_square_test(data, categorical_vars):
    for var in categorical_vars:
        crosstab = pd.crosstab(data['missing_indicator'], data[var], dropna=False)
        chi2, p, dof, expected = chi2_contingency(crosstab)
        print(f"Chi-Square Test p-value for '{var}': {p}")

# Function to perform T-Test for numerical variables
def t_test(data, numerical_vars):
    for var in numerical_vars:
        group_missing = data[data['missing_indicator'] == 1][var]
        group_not_missing = data[data['missing_indicator'] == 0][var]
        t_stat, p_val = ttest_ind(group_missing, group_not_missing, nan_policy='omit')
        print(f"T-Test p-value for '{var}': {p_val}")

def analyze_missingness(data, target_vars, numerical_vars, categorical_vars):
    for target_var in target_vars:
        print(f"Analyzing missingness for: {target_var}")
        # Creating missing indicator for the target variable
        data['missing_indicator'] = data[target_var].isnull().astype(int)

        # Perform Chi-Square Test for categorical variables
        chi_square_test(data, categorical_vars)

        # Perform T-Test for numerical variables
        t_test(data, numerical_vars)

        # Cleanup by removing the missing indicator to prepare for the next iteration
        data.drop('missing_indicator', axis=1, inplace=True)

# List of variables to analyze for missingness
target_vars = ['DEROG', 'DELINQ', 'NINQ', 'MORTDUE', 'YOJ']

analyze_missingness(data, target_vars, numerical_columns, categorical_columns)
Analyzing missingness for: DEROG
Chi-Square Test p-value for 'JOB': 1.5990906100488e-08
Chi-Square Test p-value for 'REASON': 0.027062226903309197
T-Test p-value for 'LOAN': 6.473199772544665e-10
T-Test p-value for 'MORTDUE': 0.05800547283711083
T-Test p-value for 'VALUE': 0.00431179403233319
T-Test p-value for 'YOJ': 0.027449058255107944
T-Test p-value for 'DEROG': nan
T-Test p-value for 'DELINQ': 5.044171424812875e-63
T-Test p-value for 'CLAGE': 0.007609702846965379
T-Test p-value for 'NINQ': 4.396373330994414e-17
T-Test p-value for 'CLNO': 1.7628567568539897e-09
T-Test p-value for 'DEBTINC': 0.7741292421448079
Analyzing missingness for: DELINQ
Chi-Square Test p-value for 'JOB': 5.109007838191516e-08
Chi-Square Test p-value for 'REASON': 0.9646562191000062
T-Test p-value for 'LOAN': 0.0029638658137338967
T-Test p-value for 'MORTDUE': 0.28484137204793886
T-Test p-value for 'VALUE': 5.491012738618064e-10
T-Test p-value for 'YOJ': 3.3645166413844434e-12
T-Test p-value for 'DEROG': 4.624383627816484e-35
T-Test p-value for 'DELINQ': nan
T-Test p-value for 'CLAGE': 0.013929240237087113
T-Test p-value for 'NINQ': 4.733497983306184e-14
T-Test p-value for 'CLNO': 0.2537841603573451
T-Test p-value for 'DEBTINC': 0.0005287782652992446
Analyzing missingness for: NINQ
Chi-Square Test p-value for 'JOB': 2.8732129247173023e-09
Chi-Square Test p-value for 'REASON': 0.0019763944090644683
T-Test p-value for 'LOAN': 1.7552912380770204e-07
T-Test p-value for 'MORTDUE': 0.10418233931265687
T-Test p-value for 'VALUE': 1.1583923297513604e-06
T-Test p-value for 'YOJ': 1.2209486897259112e-06
T-Test p-value for 'DEROG': 1.6037639234345464e-24
T-Test p-value for 'DELINQ': 7.70451409694177e-08
T-Test p-value for 'CLAGE': 0.4045675477500539
T-Test p-value for 'NINQ': nan
T-Test p-value for 'CLNO': 0.2959530795256866
T-Test p-value for 'DEBTINC': 5.228354749733804e-15
Analyzing missingness for: MORTDUE
Chi-Square Test p-value for 'JOB': 2.3821585508420545e-32
Chi-Square Test p-value for 'REASON': 2.075224022308458e-33
T-Test p-value for 'LOAN': 0.391394413953697
T-Test p-value for 'MORTDUE': nan
T-Test p-value for 'VALUE': 2.030110648485794e-46
T-Test p-value for 'YOJ': 0.33041270542226064
T-Test p-value for 'DEROG': 0.006338132538042123
T-Test p-value for 'DELINQ': 0.06293341574546857
T-Test p-value for 'CLAGE': 0.06302707121424661
T-Test p-value for 'NINQ': 0.0011816833564071351
T-Test p-value for 'CLNO': 2.3732824460988805e-77
T-Test p-value for 'DEBTINC': 6.456704925116695e-43
Analyzing missingness for: YOJ
Chi-Square Test p-value for 'JOB': 2.1408200333960915e-42
Chi-Square Test p-value for 'REASON': 0.0710290417732955
T-Test p-value for 'LOAN': 0.0002735686785907759
T-Test p-value for 'MORTDUE': 8.88376354789684e-08
T-Test p-value for 'VALUE': 1.0935616110950211e-11
T-Test p-value for 'YOJ': nan
T-Test p-value for 'DEROG': 8.826499217857786e-05
T-Test p-value for 'DELINQ': 0.002919192113766771
T-Test p-value for 'CLAGE': 1.6995749956210638e-10
T-Test p-value for 'NINQ': 0.0024950866646551127
T-Test p-value for 'CLNO': 2.692602511931492e-18
T-Test p-value for 'DEBTINC': 0.09096551714872332
In [ ]:
# DEROG DELINQ CLINQ


# Identifying missing values for NINQ, DELINQ, and DEROG
missing_ninq = data['NINQ'].isnull()
missing_delinq = data['DELINQ'].isnull()
missing_derog = data['DEROG'].isnull()

# Combine missing conditions
missing_any = missing_ninq | missing_delinq | missing_derog
missing_all = missing_ninq & missing_delinq & missing_derog

# Descriptive statistics for CLAGE where any or all of the NINQ, DELINQ, DEROG are missing
print("CLAGE where any of NINQ, DELINQ, DEROG are missing:")
print(data.loc[missing_any, 'CLAGE'].describe())

print("\nCLAGE where all of NINQ, DELINQ, DEROG are missing:")
print(data.loc[missing_all, 'CLAGE'].describe())

# Descriptive statistics for CLNO where any or all of the NINQ, DELINQ, DEROG are missing
print("CLNO where any of NINQ, DELINQ, DEROG are missing:")
print(data.loc[missing_any, 'CLNO'].describe())

print("\nCLNO where all of NINQ, DELINQ, DEROG are missing:")
print(data.loc[missing_all, 'CLNO'].describe())

print("CLAGE and CLNO where NINQ, DELINQ, DEROG data are not missing:")
print(data.loc[~missing_any, ['CLAGE', 'CLNO']].describe())
CLAGE where any of NINQ, DELINQ, DEROG are missing:
count    615.000000
mean     181.681224
std       97.253324
min        0.507115
25%      109.590773
50%      155.143161
75%      253.116902
max      485.945358
Name: CLAGE, dtype: float64

CLAGE where all of NINQ, DELINQ, DEROG are missing:
count    149.000000
mean     175.009541
std       79.750657
min       70.253395
25%      115.312749
50%      159.801493
75%      184.296687
max      354.735919
Name: CLAGE, dtype: float64
CLNO where any of NINQ, DELINQ, DEROG are missing:
count    615.000000
mean      22.452033
std       11.774921
min        4.000000
25%       13.000000
50%       20.000000
75%       29.000000
max       56.000000
Name: CLNO, dtype: float64

CLNO where all of NINQ, DELINQ, DEROG are missing:
count    149.000000
mean      20.194631
std        8.782413
min        8.000000
25%       13.000000
50%       19.000000
75%       24.000000
max       39.000000
Name: CLNO, dtype: float64
CLAGE and CLNO where NINQ, DELINQ, DEROG data are not missing:
             CLAGE         CLNO
count  5037.000000  5123.000000
mean    179.532467    21.157330
std      84.314438     9.916689
min       0.000000     0.000000
25%     116.614859    15.000000
50%     174.506408    20.000000
75%     230.242235    26.000000
max    1168.233561    71.000000

General Observations: Significant p-values (<0.05) indicate a statistical relationship between the missingness of the target variable and the tested variable. A very low p-value suggests that the likelihood of observing the data under the null hypothesis (that there's no association between the missingness of the target variable and the tested variable) is extremely low. Non-significant p-values (≥0.05) suggest there's not enough evidence to conclude a relationship between the missingness of the target variable and the tested variable. nan in T-Test p-value results typically occurs when testing the variable against itself for missingness or when there's no variance in the group (e.g., all values are the same or missing).

"DEROG" Missingness: Strong associations are observed with "JOB", "REASON", "LOAN", "VALUE", "YOJ", "DELINQ", "CLAGE", "NINQ", and "CLNO". This widespread correlation suggests that the missingness in "DEROG" might be systematically related to both the applicant's job and reason for the loan, financial factors (loan amount, property value, years on the job), and other aspects of their credit history (delinquencies, credit inquiries, number of credit lines). The missingness here could be influenced by applicants' characteristics or might indicate a pattern where applicants with certain profiles are more likely to have or to omit this information.

"DELINQ" Missingness: The missingness shows strong associations with "JOB", "VALUE", "YOJ", "DEROG", "NINQ", and "DEBTINC". Similar to "DEROG", the pattern suggests a relationship between missingness and both employment-related factors and detailed financial variables, indicating possible profile similarities among those missing this data.

"NINQ" Missingness: Significant correlations with "JOB", "REASON", "LOAN", "VALUE", "YOJ", "DEROG", "DELINQ", and "DEBTINC" highlight how missingness in credit inquiries is linked to a wide range of factors, possibly pointing towards either data collection issues or specific applicant characteristics that lead to this missingness.

"MORTDUE" Missingness: Very strong associations with "JOB", "REASON", "VALUE", "NINQ", "CLNO", and "DEBTINC" indicate that the missing mortgage due amounts are not random and are particularly linked to the job, the reason for the loan, property values, and debt-to-income ratios, suggesting a pattern in the types of applicants or the conditions under which this data tends to be missing.

"YOJ" Missingness: Missing "YOJ" data shows significant links to "JOB", "LOAN", "MORTDUE", "VALUE", "DEROG", "DELINQ", "CLAGE", "NINQ", and "CLNO", implying that the missingness could be related to the applicants' job and financial details, including their credit history and loan characteristics.

Summary of Imputation Techniques:¶

In our data analysis, missingness in "NINQ", "DELINQ", and "DEROG" doesn't significantly alter the mean values of "CLAGE" (Age of Credit Line) and "CLNO" (Number of Credit Lines), suggesting other factors might contribute to the missing data. Hence, we considered Median Imputation and KNN Imputation for handling missing values in our predictive modeling phase.

Decision Rationale:

  1. Median Imputation: Chosen for its simplicity, speed, and robustness to outliers, making it suitable as a baseline imputation method.
  2. KNN Imputation: Preferred due to its ability to preserve relationships between variables, crucial given the high correlations observed among certain variables in our dataset. We selected k=5 to balance capturing local patterns without overfitting.

Additionally, for "DEBTINC", we employed median imputation due to its randomness and right skewness, ensuring a more representative imputation method for this variable.

Considering the dataset's structure and the need to preserve inter-variable relationships, we opted for KNN imputation with k=5, aiming for accurate imputations while managing computational resources and model complexity. Median imputation was specifically chosen for "DEBTINC" to address its skewed distribution and randomness.

In [ ]:
#keep a copy of the data
original_data = data.copy()
In [ ]:
# Initialize the KNNImputer
imputer = KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean')

# Select variables for imputation and relevant variables for KNN context
variables_to_impute = ['DEROG', 'DELINQ', 'NINQ', 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'CLAGE', 'CLNO']

# Prepare data for imputation
data_to_impute = data[variables_to_impute].copy()

# Apply a log transformation to skewed variables, adding 1 to handle zeros
for col in ['DEROG', 'DELINQ', 'NINQ']:
    data_to_impute[col] = np.log1p(data_to_impute[col])

# Perform the imputation
imputed_data_log_transformed = imputer.fit_transform(data_to_impute)

# Convert the imputed numpy array back to a DataFrame
imputed_df_log_transformed = pd.DataFrame(imputed_data_log_transformed, columns=variables_to_impute)

# Reverse the log transformation
for col in ['DEROG', 'DELINQ', 'NINQ']:
    imputed_df_log_transformed[col] = np.expm1(imputed_df_log_transformed[col])

# Update the original DataFrame with the imputed values
data.update(imputed_df_log_transformed)

# Verify the imputation
print(data[variables_to_impute].describe())
             DEROG       DELINQ         NINQ          LOAN        MORTDUE  \
count  5960.000000  5960.000000  5960.000000   5960.000000    5960.000000   
mean      0.258066     0.479808     1.152833  18607.969799   72849.947455   
std       0.800957     1.094570     1.665734  11207.480417   43153.211045   
min       0.000000     0.000000     0.000000   1100.000000    2063.000000   
25%       0.000000     0.000000     0.000000  11100.000000   46431.250000   
50%       0.000000     0.000000     1.000000  16300.000000   64373.400000   
75%       0.000000     0.643752     2.000000  23300.000000   89939.000000   
max      10.000000    15.000000    17.000000  89900.000000  399550.000000   

               VALUE          YOJ        CLAGE         CLNO  
count    5960.000000  5960.000000  5960.000000  5960.000000  
mean   101265.043217     8.907342   179.212184    21.225570  
std     57256.005044     7.371312    84.143410    10.005451  
min      8000.000000     0.000000     0.000000     0.000000  
25%     65683.500000     3.000000   116.761588    15.000000  
50%     88605.000000     7.000000   172.735145    20.000000  
75%    119229.000000    13.000000   228.041251    26.000000  
max    855909.000000    41.000000  1168.233561    71.000000  
In [ ]:
## handle DEBTINC
median_imputer = SimpleImputer(strategy='median')
debtinc_reshaped = data['DEBTINC'].values.reshape(-1, 1)
data['DEBTINC'] = median_imputer.fit_transform(debtinc_reshaped)
In [ ]:
## create missing category for categorical data
for c in categorical_columns:
    data[c].fillna('Unknown', inplace=True)
In [ ]:
msno.matrix(data)
plt.show()
In [ ]:
# Display summary statistics for the data
display("Median Imputed DataFrame Summary Statistics:", data.describe())
'Median Imputed DataFrame Summary Statistics:'
BAD LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
count 5960.000000 5960.000000 5960.000000 5960.000000 5960.000000 5960.000000 5960.000000 5960.000000 5960.000000 5960.000000 5960.000000
mean 0.199497 18607.969799 72849.947455 101265.043217 8.907342 0.258066 0.479808 179.212184 1.152833 21.225570 34.000651
std 0.399656 11207.480417 43153.211045 57256.005044 7.371312 0.800957 1.094570 84.143410 1.665734 10.005451 7.644528
min 0.000000 1100.000000 2063.000000 8000.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.524499
25% 0.000000 11100.000000 46431.250000 65683.500000 3.000000 0.000000 0.000000 116.761588 0.000000 15.000000 30.763159
50% 0.000000 16300.000000 64373.400000 88605.000000 7.000000 0.000000 0.000000 172.735145 1.000000 20.000000 34.818262
75% 0.000000 23300.000000 89939.000000 119229.000000 13.000000 0.000000 0.643752 228.041251 2.000000 26.000000 37.949892
max 1.000000 89900.000000 399550.000000 855909.000000 41.000000 10.000000 15.000000 1168.233561 17.000000 71.000000 203.312149
  • Observations from Summary Statistics
In [ ]:
## distribution of the completed dataset
for column in numerical_columns:
    plot_distribution_and_boxplot(data, column)

the distributions are not impacted by the filling the data

Answering the leading questions for Univariate Analysis¶

  1. What is the range of values for the loan amount variable "LOAN"?

  2. How does the distribution of years at present job "YOJ" vary across the dataset?

  3. How many unique categories are there in the REASON variable?

  4. What is the most common category in the JOB variable?

In [ ]:
# gettingh the minimum and maximum loan amounts
loan_min = data['LOAN'].min()
loan_max = data['LOAN'].max()

loan_range = loan_max - loan_min

print(f"Minimum Loan Amount: ${loan_min}")
print(f"Maximum Loan Amount: ${loan_max}")
print(f"Range of Loan Amounts: ${loan_range}")
Minimum Loan Amount: $1100.0
Maximum Loan Amount: $89900.0
Range of Loan Amounts: $88800.0

How does the distribution of years at present job "YOJ" vary across the dataset?

Average Tenure: The average job tenure is approximately 8.91 years, indicating a moderately long period spent by individuals in their current jobs. However, the median tenure of 7 years suggests that half of the dataset's individuals have stayed in their jobs for a shorter duration, pointing towards a majority with relatively recent job changes.

Variability: There's a wide range in job tenure, from 0 to 41 years, with a standard deviation of about 7.37 years. This indicates diverse job tenure experiences among the individuals in the dataset.

Distribution: The distribution of job tenures is right-skewed, as evidenced by a mean that is higher than the median and the presence of individuals with very long tenures that extend up to 41 years. This skewness suggests that while many individuals have shorter tenures, there's a significant number with exceptionally long tenures, pulling the average higher.

Quartile Insights: A quarter of the dataset's individuals have been in their current jobs for 3 years or less, while 75% have tenures of 13 years or less. This quartile distribution further underscores the skewness towards shorter job tenures within the dataset.

How many unique categories are there in the REASON variable?

Three:

HomeImp "Home Improvement", DEBTCon "Debt Consolidation", and Unkown

What is the most common category in the JOB variable? OTHER

Bivariate Analysis¶

Leading questions

Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?

Do applicants who default have a significantly different loan amount compared to those who repay their loan?

Is there a correlation between the value of the property and the loan default rate?

Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?

In [ ]:
# Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?
default_proportions = data.groupby('REASON')['BAD'].mean().reset_index()

plt.figure(figsize=(8, 6))
sns.barplot(x='REASON', y='BAD', data=default_proportions, palette='coolwarm')
plt.xlabel('Reason for Loan')
plt.ylabel('Proportion of Defaults')
plt.title('Proportion of Loan Defaults by Reason')
plt.xticks(rotation=45)  # Rotate category labels for better readability
plt.show()
<ipython-input-98-738c8c357385>:5: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='REASON', y='BAD', data=default_proportions, palette='coolwarm')

Home improvement seems to have more propotion on defaulted loans but it is not representative, both values behave similarly

In [ ]:
#Do applicants who default have a significantly different loan amount compared to those who repay their loan?

plt.figure(figsize=(10, 6))
sns.boxplot(x='BAD', y='LOAN', data=data, palette='coolwarm', notch=True, width=0.5)
plt.xticks([0, 1], ['Repaid', 'Defaulted'])  # Setting custom labels for clarity
plt.title('Loan Amount Distribution by Repayment Status')
plt.ylabel('Loan Amount')
plt.xlabel('Status')
plt.show()
<ipython-input-99-753af1fa61fe>:4: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='BAD', y='LOAN', data=data, palette='coolwarm', notch=True, width=0.5)

The data analysis does not provide conclusive evidence on whether the loan amount impacts the likelihood of loans being paid or defaulted. Both paid and defaulted loans exhibit similar means and contain outliers. Therefore, we cannot confidently conclude that loan amount is a determining factor for loan repayment or default.

In [ ]:
#Is there a correlation between the value of the property and the loan default rate?

correlation_coefficient = data['VALUE'].corr(data['BAD'])
print(f"Correlation coefficient between VALUE and BAD: {correlation_coefficient}")

plt.figure(figsize=(10, 6))
sns.scatterplot(x='VALUE', y='BAD', data=data, hue='BAD', palette='coolwarm', alpha=0.6)
plt.title('Correlation between Property Value and Loan Default Rate')
plt.xlabel('Property Value')
plt.ylabel('Loan Default Status')
plt.yticks([0, 1], ['Repaid', 'Defaulted'])  # Adjusting y-ticks for clarity
plt.show()
Correlation coefficient between VALUE and BAD: -0.04547284494145901

The scatter plot depicting the relationship between "VALUE" and "BAD" reveals a correlation coefficient of approximately -0.045. Despite the low correlation, interesting patterns emerge in the data visualization.

However, it's important to note that these clusters do not imply a strong linear relationship between "VALUE" and loan repayment status. The clustering may suggest potential thresholds or ranges within which certain values of "VALUE" are more common for both paid and defaulted loans. Further analysis beyond correlation coefficients, such as examining distributions and exploring nonlinear relationships, may provide deeper insights into the association between loan value and repayment status.

In [ ]:
#Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?

plt.figure(figsize=(10, 6))
sns.boxplot(x='BAD', y='MORTDUE', data=data, palette='coolwarm')
plt.xticks([0, 1], ['Repaid', 'Defaulted'])
plt.title('Mortgage Amount Distribution by Loan Repayment Status')
plt.xlabel('Loan Status')
plt.ylabel('Mortgage Amount')
plt.show()
<ipython-input-103-afd77bc710b5>:4: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='BAD', y='MORTDUE', data=data, palette='coolwarm')

The boxplot illustrates that defaulted loans do not display clear distinctions from paid loans; both distributions appear highly similar. Despite subtle clustering within specific ranges of "VALUE" for both paid and defaulted loans, the overall shapes of the distributions largely overlap. This observation underscores the difficulty of predicting loan repayment status based solely on the loan value. Further exploration of additional factors may be required to gain insights into the determinants of loan default.

In [ ]:
def dist_boxplot(x, **kwargs):
    ax = sns.histplot(x, kde=False, color="skyblue", alpha=0.6, bins=30)

    # Creating a twin axis to overlay a boxplot
    ax2 = ax.twinx()
    sns.boxplot(x=x, ax=ax2, width=0.5, fliersize=2)
    ax2.set(ylim=(-5, 5))

    # Making boxplot transparent background
    ax2.set_zorder(1)
    ax2.patch.set_visible(False)


# For 'CLAGE'
g = sns.FacetGrid(data, col="BAD")
g.map(dist_boxplot, 'CLAGE')

# For 'NINQ'
g = sns.FacetGrid(data, col="BAD")
g.map(dist_boxplot, 'NINQ')

# For 'CLNO'
g = sns.FacetGrid(data, col="BAD")
g.map(dist_boxplot, 'CLNO')

# For 'DEBTINC'
g = sns.FacetGrid(data, col="BAD")
g.map(dist_boxplot, 'DEBTINC')

plt.show()

From the plots several conclusions can be drawn:

  1. CLAGE (Age of Credit Line):

    • The distribution of CLAGE for both paid and defaulted loans appears similar, with overlapping boxplots.
    • There are outliers present in both distributions, indicating extreme values of CLAGE for both loan repayment statuses.
  2. NINQ (Number of Inquiries):

    • The distributions of NINQ for paid and defaulted loans exhibit notable differences.
    • Defaulted loans tend to have a higher number of inquiries (NINQ) compared to paid loans, as evidenced by the higher median and larger spread of values.
    • Defaulted loans also display more outliers, suggesting a wider range of NINQ values associated with loan defaults.
  3. CLNO (Number of Credit Lines):

    • Similar to CLAGE, the distributions of CLNO for both paid and defaulted loans are quite similar, with overlapping boxplots.
    • Outliers are present in both distributions, indicating extreme values of CLNO for both loan repayment statuses.
  4. DEBTINC (Debt-to-Income Ratio):

    • The distributions of DEBTINC for paid and defaulted loans display noticeable differences.
    • Defaulted loans tend to have higher debt-to-income ratios (DEBTINC) compared to paid loans, as evidenced by the higher median and larger spread of values.
    • Defaulted loans also exhibit more outliers, suggesting a wider range of DEBTINC values associated with loan defaults.
In [ ]:
# Assuming 'data' is your DataFrame
sns.set_style("whitegrid")

# Count the observations for each category of 'BAD'
observation_counts = data['BAD'].value_counts().sort_index()

# Create a figure to hold the subplots
plt.figure(figsize=(14, 6))

# Plot for DEROG
plt.subplot(1, 2, 1)  # 1 row, 2 columns, 1st subplot
sns.boxplot(x='BAD', y='DEROG', data=data)
# Adding observation count to the title
plt.title(f'Derogatory Reports by Loan Default Status\nCounts: {observation_counts.to_string()}')

# Plot for DELINQ
plt.subplot(1, 2, 2)  # 1 row, 2 columns, 2nd subplot
sns.boxplot(x='BAD', y='DELINQ', data=data)
# Adding observation count to the title
plt.title(f'Delinquent Credit Lines by Loan Default Status\nCounts: {observation_counts.to_string()}')

plt.tight_layout()
plt.show()

Borrowers who defaulted on their mortgage have more delinquent credit lines and major derogatory reports than those who did not.

In [ ]:
sns.set_style("whitegrid")

# Create a figure to hold the subplots
plt.figure(figsize=(14, 6))

# Count the observations for each category of 'BAD'
observation_counts = data['BAD'].value_counts()

# Plot for DEBTINC
plt.subplot(1, 2, 1)  # 1 row, 2 columns, 1st subplot
sns.boxplot(x='BAD', y='DEBTINC', data=data)
# Adding observation count to the title
plt.title(f'Debt-to-Income Ratio by Loan Default Status\nCounts: {observation_counts[0]} (BAD=0), {observation_counts[1]} (BAD=1)')

# Plot for LOAN
plt.subplot(1, 2, 2)  # 1 row, 2 columns, 2nd subplot
sns.boxplot(x='BAD', y='LOAN', data=data)
# Adding observation count to the title
plt.title(f'Loan Request Amount by Loan Default Status\nCounts: {observation_counts[0]} (BAD=0), {observation_counts[1]} (BAD=1)')

plt.tight_layout()
plt.show()

Debt-to-Income Ratio (DEBTINC) by Loan Default Status:

  • The boxplot shows that defaulted loans (BAD=1) tend to have higher debt-to-income ratios compared to paid loans (BAD=0).
  • Defaulted loans exhibit a wider spread of debt-to-income ratios, as indicated by the larger interquartile range (IQR) and the presence of more outliers.
  • This suggests that higher debt-to-income ratios may be associated with an increased likelihood of loan default.

Loan Request Amount (LOAN) by Loan Default Status:

  • The boxplot reveals a potential difference in loan request amounts between paid and defaulted loans.
  • Defaulted loans (BAD=1) appear to have slightly lower median loan amounts compared to paid loans (BAD=0).
  • However, both distributions exhibit considerable variability, with overlapping interquartile ranges and a similar number of outliers.
  • This suggests that while there may be some differences in loan amounts between paid and defaulted loans, other factors likely contribute to loan default beyond loan size alone.
In [ ]:
plt.figure(figsize=(8, 6))
observation_counts = data['BAD'].value_counts()
sns.boxplot(x='BAD', y='CLAGE', data=data)
plt.title(f'Age of Oldest Credit Line by Loan Default Status\nCounts: {observation_counts[0]} (BAD=0), {observation_counts[1]} (BAD=1)')

plt.xlabel('Loan Default Status')
plt.ylabel('Age of Oldest Credit Line (Months)')

plt.show()

Intuitively, borrowers who paid their loans are credible to the bank, thus have older credit lines, make more recent credit inquiries, and generally have more credit lines in total than default customers.

Borrowers who did not default have significantly higher debt-income ratio as well as loan request amount, as they are able to continue borrowing from the bank given their extensive credit lines (without having to reapply).

Multivariable analysis¶

In [ ]:
correlation_matrix = data.corr(numeric_only=True)

# Plotting heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True)
plt.show()

Relationship with "BAD" (Loan Default Indicator)

DEROG (Number of Major Derogatory Reports) and DELINQ (Number of Delinquent Credit Lines) have the most substantial positive correlations with "BAD" at 0.26 and 0.33, respectively. This suggests that applicants with more derogatory reports and delinquencies are more likely to default on their loans.

CLAGE (Age of Oldest Credit Line in Months) shows a significant negative correlation (-0.16) with "BAD", indicating that longer credit histories are associated with a lower likelihood of default.

NINQ (Number of Recent Credit Inquiries) has a moderately positive correlation (0.17) with "BAD", suggesting that a higher number of recent credit inquiries might be associated with a higher risk of default, possibly reflecting financial stress or overextension.

DEBTINC (Debt-to-Income Ratio) also shows a positive correlation (0.15) with "BAD", indicating that higher debt relative to income might increase the likelihood of loan default.

Other Notable Correlations

MORTDUE (Amount Owed on Mortgage) and VALUE (Current Value of the Property) are highly correlated (0.78), as expected, since larger mortgages are typically associated with more valuable properties.

MORTDUE and CLNO (Number of Credit Lines) show a significant positive correlation (0.32), suggesting that individuals with higher mortgage amounts also tend to have more credit lines, possibly reflecting higher overall credit engagement or financial activity.

CLAGE shows a positive correlation with YOJ (Years at Present Job) (0.18), indicating that individuals with longer employment tenure also tend to have older credit lines, which could reflect overall financial stability.

Insights for Predictive Modeling

Variables like DEROG, DELINQ, NINQ, and DEBTINC could be key predictors in a model aimed at predicting loan defaults, given their significant correlations with "BAD".

The negative correlation of CLAGE with "BAD" highlights the importance of considering the age of credit history as a potential protective factor against default.

The lack of a strong correlation between LOAN size and default risk ("BAD") suggests that the amount of the loan itself is not as predictive of default as the borrower's credit history and current financial obligations.

Treating Outliers¶

Given our dataset's inherent right skewness and presence of outliers, it becomes imperative to normalize the data for effective analysis. To achieve this, we employed the np.log1p transformation as it aligns well with the characteristics of our data. This transformation not only addresses the skewness but also mitigates the impact of outliers, rendering the data more suitable for subsequent analysis.

In [ ]:
# Create a new DataFrame for the log-transformed data
log_transformed_data = data.copy()

# Apply log transformation to all numerical columns
# Using np.log1p to handle columns with zeros by computing log(1 + x) for each element
for col in numerical_columns:
    log_transformed_data[col] = np.log1p(log_transformed_data[col])

log_transformed_data[numerical_columns].head()
Out[ ]:
LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
0 7.003974 10.160491 10.571983 2.442347 0.000000 0.000000 4.557729 0.693147 2.302585 3.578458
1 7.170888 11.157022 11.133143 2.079442 0.000000 1.098612 4.810828 0.000000 2.708050 3.578458
2 7.313887 9.510519 9.723224 1.609438 0.000000 0.000000 5.013742 0.693147 2.397895 3.578458
3 7.313887 10.780655 10.934745 2.054124 0.277259 0.439445 4.718258 0.415888 2.610070 3.578458
4 7.438972 11.490690 11.626263 1.386294 0.000000 0.000000 4.546835 0.000000 2.708050 3.578458
In [ ]:
for column in numerical_columns:
    plot_distribution_and_boxplot(log_transformed_data, column)

Important Insights from EDA¶

The exploratory data analysis (EDA) reveals several significant insights into factors influencing loan default risk. Derived from the correlation matrix, variables like DEROG, DELINQ, NINQ, and DEBTINC exhibit notable correlations with loan default ("BAD"), indicating that applicants with more derogatory reports, delinquent credit lines, recent credit inquiries, and higher debt-to-income ratios are more likely to default on loans. Conversely, CLAGE (Age of Oldest Credit Line) shows a negative correlation with default, suggesting that longer credit histories may mitigate default risk. Notably, loan size (LOAN) demonstrates a weaker correlation with default, indicating that other factors such as credit history and financial obligations play significant roles in predicting default. These findings underscore the importance of incorporating multiple variables, including credit history, recent financial behavior, and debt levels, in predictive modeling to accurately assess loan default risk.

Model Building - Approach¶

  • Data preparation
  • Partition the data into train and test set
  • Build the model
  • Fit on the train data
  • Tune the model
  • Test the model on test set
In [ ]:
## prepare data

X = log_transformed_data.drop('BAD', axis=1)
y = log_transformed_data['BAD']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# due to several outliers after normalization, lets winsorize the data for logistic regression
X_train_winsorized = X_train.copy()
for column in numerical_columns:
    X_train_winsorized[column] = winsorize(X_train_winsorized[column], limits=[0.05, 0.05])


# Adjusted Preprocessing for Categorical Data
# Define the ColumnTransformer for categorical data only, since numerical data doesn't require further preprocessing
preprocessor = ColumnTransformer(transformers=[
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns)
], remainder='passthrough')  # 'remainder=passthrough' to keep numerical columns without transformation

Logistic Regression¶

In [ ]:
# Define the model
logistic_regression_model = LogisticRegression(max_iter=1000, random_state=42)
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                 ('model', logistic_regression_model)])
model_pipeline.fit(X_train_winsorized, y_train)
Out[ ]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['JOB', 'REASON'])])),
                ('model', LogisticRegression(max_iter=1000, random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['JOB', 'REASON'])])),
                ('model', LogisticRegression(max_iter=1000, random_state=42))])
ColumnTransformer(remainder='passthrough',
                  transformers=[('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['JOB', 'REASON'])])
['JOB', 'REASON']
OneHotEncoder(handle_unknown='ignore')
['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
passthrough
LogisticRegression(max_iter=1000, random_state=42)
In [ ]:
# Predictions
y_pred = model_pipeline.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Accuracy: 0.8196308724832215
Confusion Matrix:
 [[873  54]
 [161 104]]
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.94      0.89       927
           1       0.66      0.39      0.49       265

    accuracy                           0.82      1192
   macro avg       0.75      0.67      0.69      1192
weighted avg       0.80      0.82      0.80      1192

The logistic regression model is achieving an accuracy rate of approximately 81.96%. The model excelled in identifying loans that were paid off (class 0), with a precision of 84% and a high recall of 94%, yielding an F1-score of 89%. This indicates a strong capability in accurately predicting loans that would be repaid. However, the model's performance in detecting defaulted loans (class 1) was less effective, evidenced by a recall of 39% and precision of 66%, leading to a modest F1-score of 49%. These results highlight a significant challenge in correctly identifying default cases, which is critical for risk assessment and mitigation in financial lending.

Lets try with another model

Decision Tree¶

In [ ]:
decision_tree_model = DecisionTreeClassifier(random_state=42, class_weight = {0: 0.2, 1: 0.8}) # make class 1 more important
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                 ('model', decision_tree_model)])
In [ ]:
model_pipeline.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['JOB', 'REASON'])])),
                ('model',
                 DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7},
                                        random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['JOB', 'REASON'])])),
                ('model',
                 DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7},
                                        random_state=42))])
ColumnTransformer(remainder='passthrough',
                  transformers=[('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['JOB', 'REASON'])])
['JOB', 'REASON']
OneHotEncoder(handle_unknown='ignore')
['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
passthrough
DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7}, random_state=42)
In [ ]:
y_pred = model_pipeline.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Accuracy: 0.8691275167785235
Confusion Matrix:
 [[867  60]
 [ 96 169]]
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.94      0.92       927
           1       0.74      0.64      0.68       265

    accuracy                           0.87      1192
   macro avg       0.82      0.79      0.80      1192
weighted avg       0.86      0.87      0.87      1192

The Decision Tree model achieved an impressive accuracy of approximately 87%. The model demonstrated strong performance in identifying loans that were successfully repaid (class 0), with a precision of 90% and a recall of 94%, resulting in an F1-score of 92%. This indicates a high reliability in predicting non-default cases. For defaulted loans (class 1), the model also showed commendable results with a precision of 74%, a recall of 64%, and an F1-score of 68%, suggesting a solid ability to identify default cases, albeit with room for improvement. The overall model accuracy and the detailed performance metrics underscore the Decision Tree's effectiveness in distinguishing between paid and defaulted loans. The weighted average F1-score of 87% reflects the model's robustness across both classes. However, the slightly lower recall for class 1 highlights an area for potential enhancement, aiming to better capture defaulted loans without significantly compromising the precision.

Let see if we can get better results by tunning the tree

In [ ]:
decision_tree_model.get_depth()
Out[ ]:
24
In [ ]:
def evaluate_model_depth_with_test(X_train, y_train, X_test, y_test, depth_range, cv_folds=5):
    """Evaluates Decision Tree model over a range of depths using both cross-validation on the training set
    and evaluation on the test set.

    Args:
        X_train (DataFrame): Training features.
        y_train (Series): Training target variable.
        X_test (DataFrame): Test features.
        y_test (Series): Test target variable.
        depth_range (range): Range of depths to evaluate.
        cv_folds (int): Number of folds for cross-validation.

    Returns:
        DataFrame: Contains average cross-validated scores for each depth and test set scores.
    """
    cv_scores = []
    test_errors = []

    for depth in depth_range:
        clf = DecisionTreeClassifier(max_depth=depth, random_state=42, class_weight={0: 0.3, 1: 0.7})

        model_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', clf)])

        # Cross-validation on training data
        scores = cross_val_score(model_pipeline, X_train, y_train, cv=cv_folds, scoring='neg_mean_squared_error')
        cv_scores.append(np.mean(scores))

        model_pipeline.fit(X_train, y_train)
        y_test_pred = model_pipeline.predict(X_test)
        test_error = mean_squared_error(y_test, y_test_pred)
        test_errors.append(test_error)

    return pd.DataFrame({
        'Max Depth': list(depth_range),
        'CV Score': cv_scores,
        'Test Error': test_errors
    })


depth_range = range(1, 24)


results = evaluate_model_depth_with_test(X_train, y_train, X_test, y_test, depth_range)

# Convert the scores to positive; higher is better for CV Score, lower is better for Test Error
results['CV Misclassification Error'] = -results['CV Score']
results['Test Misclassification Error'] = results['Test Error']

# Plotting
plt.figure(figsize=(12, 8))
plt.plot(results['Max Depth'], results['CV Misclassification Error'], marker='o', linestyle='-', color='b', label='CV Misclassification Error')
plt.plot(results['Max Depth'], results['Test Misclassification Error'], marker='s', linestyle='--', color='r', label='Test Misclassification Error')
plt.title('Max Depth vs. Misclassification Error')
plt.xlabel('Max Depth')
plt.ylabel('Misclassification Error')
plt.grid(True)
plt.xticks(depth_range)
plt.legend()
plt.show()

It seems that at depth 11 we have a minimized test error and CV maximized, it represents a good balance between model complexity and generalization ability.

Decision Tree - Hyperparameter Tuning¶

  • Hyperparameter tuning is tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. We'll use Grid search to perform hyperparameter tuning.
  • Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
  • It is an exhaustive search that is performed on the specific parameter values of a model.
  • The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Criterion {“gini”, “entropy”}

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

max_depth

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leaf

The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

You can learn about more Hyperpapameters on this link and try to tune them.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [ ]:
param_grid = {
    'model__max_depth': [9, 10, 11, 12, 13, 14],
    'model__min_samples_split': [2, 5, 10, 20],
    'model__min_samples_leaf': [1, 2, 4, 6, 10],
    'model__class_weight': [{0: 0.3, 1: 0.7}, 'balanced'],
    }
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)
In [ ]:
grid_search.fit(X_train, y_train)
Fitting 5 folds for each of 240 candidates, totalling 1200 fits
Out[ ]:
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['JOB',
                                                                          'REASON'])])),
                                       ('model',
                                        DecisionTreeClassifier(class_weight={0: 0.2,
                                                                             1: 0.8},
                                                               random_state=42))]),
             n_jobs=-1,
             param_grid={'model__class_weight': [{0: 0.3, 1: 0.7}, 'balanced'],
                         'model__max_depth': [9, 10, 11, 12, 13, 14],
                         'model__min_samples_leaf': [1, 2, 4, 6, 10],
                         'model__min_samples_split': [2, 5, 10, 20]},
             scoring='accuracy', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('cat',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['JOB',
                                                                          'REASON'])])),
                                       ('model',
                                        DecisionTreeClassifier(class_weight={0: 0.2,
                                                                             1: 0.8},
                                                               random_state=42))]),
             n_jobs=-1,
             param_grid={'model__class_weight': [{0: 0.3, 1: 0.7}, 'balanced'],
                         'model__max_depth': [9, 10, 11, 12, 13, 14],
                         'model__min_samples_leaf': [1, 2, 4, 6, 10],
                         'model__min_samples_split': [2, 5, 10, 20]},
             scoring='accuracy', verbose=1)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['JOB', 'REASON'])])),
                ('model',
                 DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8},
                                        random_state=42))])
ColumnTransformer(remainder='passthrough',
                  transformers=[('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['JOB', 'REASON'])])
['JOB', 'REASON']
OneHotEncoder(handle_unknown='ignore')
['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
passthrough
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=42)
In [ ]:
# Best hyperparameters
print("Best hyperparameters:\n", grid_search.best_params_)

# Best model's score
print("Best model's accuracy:", grid_search.best_score_)


best_model = grid_search.best_estimator_
Best hyperparameters:
 {'model__class_weight': {0: 0.3, 1: 0.7}, 'model__max_depth': 9, 'model__min_samples_leaf': 1, 'model__min_samples_split': 2}
Best model's accuracy: 0.8794047705469431
In [ ]:
# Predictions with the best model
y_pred = best_model.predict(X_test)

# Evaluation
print("Accuracy on test set:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Accuracy on test set: 0.8808724832214765
Confusion Matrix:
 [[858  69]
 [ 73 192]]
Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.93      0.92       927
           1       0.74      0.72      0.73       265

    accuracy                           0.88      1192
   macro avg       0.83      0.83      0.83      1192
weighted avg       0.88      0.88      0.88      1192

The hyperparameter-tuned Decision Tree model, optimized via GridSearch, demonstrated a notable increase in performance, achieving an accuracy of 88.09% on the test set, up from 86.66% with the initial model. This refined model significantly improved in detecting defaulted loans (Class 1), with recall rising from 64% to 72% and precision at 74%, indicating a stronger ability to identify defaults accurately. The overall balanced performance across classes also improved, as evidenced by the increase in macro average F1-score from 80% to 83%. This optimization underscores the effectiveness of hyperparameter tuning in enhancing model accuracy and balance, particularly in addressing class imbalances and improving default detection, making it a more reliable tool for financial risk mitigation.

In [ ]:
decision_tree_model = best_model.named_steps['model']
In [ ]:
def get_transformed_feature_names(column_transformer, input_features):
    """
    Get feature names from a ColumnTransformer that includes one-hot encoding and passthrough transformers.

    Args:
    - column_transformer: The ColumnTransformer instance.
    - input_features: The original feature names (as a list).

    Returns:
    - A list of the transformed feature names.
    """
    transformed_feature_names = []

    for transformer in column_transformer.transformers_:
        transformer_name, transformer_instance, columns = transformer

        # Handling for OneHotEncoder
        if hasattr(transformer_instance, 'get_feature_names_out'):

            if columns is not None and transformer_name != 'remainder':
                transformed_feature_names.extend(transformer_instance.get_feature_names_out(columns))
            else:
                transformed_feature_names.extend(transformer_instance.get_feature_names_out())

        elif transformer_name == 'remainder' and transformer_instance == 'passthrough':
            remainder_columns = [input_features[i] for i in columns] if isinstance(columns, list) else input_features
            transformed_feature_names.extend(remainder_columns)

    return transformed_feature_names


input_features = list(X_train.columns)
transformed_feature_names = get_transformed_feature_names(preprocessor, input_features)

plt.figure(figsize=(40, 40))
plot_tree(decision_tree_model, feature_names=transformed_feature_names, filled=True, class_names=['Paid', 'Defaulted'], max_depth=14, fontsize=10, rounded=True, precision=2, proportion=False)
plt.title('Decision Tree for Loan Default Prediction', fontsize=20)
#plt.savefig('tree_high_res.png', dpi=300, bbox_inches='tight')  # Save to file
plt.show()

key insights from the decision tree:

  1. Primary Split on Debt-to-Income Ratio (DEBTINC): The tree's initial split is based on the DEBTINC feature, indicating its significant role in predicting loan outcomes. A DEBTINC threshold of 3.58 separates the data, suggesting that lower debt-to-income ratios are associated with a higher likelihood of loan repayment.

  2. Importance of Delinquency (DELINQ) and Number of Inquiries (NINQ): Following DEBTINC, the tree frequently utilizes DELINQ and NINQ for further splits. This underscores the importance of past payment behavior and recent credit inquiries in assessing loan default risk.

  3. Role of Loan Reasons (REASON_HomeImp): The decision to split based on whether the loan is for home improvement (REASON_HomeImp) indicates that the purpose of the loan has predictive value, with different default risks associated with home improvement loans versus other reasons.

  4. Value of Collateral (VALUE) and Number of Credit Lines (CLNO): These features appear in various splits, suggesting that the collateral's value and the borrower's existing credit lines are relevant factors in predicting loan performance.

  5. Interaction of Features: The tree structure reveals complex interactions between features. For example, within certain ranges of DEBTINC and DELINQ, other variables like VALUE, NINQ, and CLNO come into play, indicating that the risk of default is contingent on multiple factors.

  6. Thresholds for Different Features: The specific thresholds used for splits (e.g., DELINQ <= 1.05, VALUE > 10.08) provide insights into critical values that distinguish between likely loan repayment and default. These thresholds can inform risk assessment strategies.

  7. Subgroup Specific Patterns: The tree highlights specific patterns for subgroups within the data. For instance, within loans with a particular DEBTINC and DELINQ profile, further distinctions based on MORTDUE, CLAGE (age of oldest credit line), and other variables indicate nuanced patterns of risk.

  8. Sensitivity to Certain Conditions: The model's branches reflect sensitivity to certain conditions, like higher DEBTINC combined with specific levels of DEROG (derogatory reports) and DELINQ, significantly influencing the prediction of default.

Comparing this decision tree's structure and insights with the initial, unoptimized model reveals the impact of hyperparameter tuning, particularly in terms of identifying meaningful splits and interactions that might not have been as clearly captured previously. The tuned model likely offers a more nuanced understanding of factors influencing loan defaults, enhancing its predictive accuracy and providing a solid basis for informed decision-making in loan approval processes.

In [ ]:
transformed_feature_names = best_model.named_steps['preprocessor'].get_feature_names_out()
importances = decision_tree_model.feature_importances_


feature_importances = pd.DataFrame({'Feature': transformed_feature_names, 'Importance': importances})

# Sort the DataFrame to display the most important features at the top
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)


plt.figure(figsize=(10, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importances.head(20))  # Display top 20 for clarity
plt.title('Feature Importance')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()
  1. Dominance of Debt-to-Income Ratio (DEBTINC): With a feature importance score of approximately 64.76%, the debt-to-income ratio is by far the most influential factor in predicting loan outcomes. This underscores the critical role of borrowers' financial health and their ability to manage existing debt relative to their income in determining loan repayment capabilities.

  2. Significance of Payment History and Credit Age: Following DEBTINC, the delinquency (DELINQ) and age of the oldest credit line (CLAGE) are the next most important features, with importance scores around 5.91% and 5.75%, respectively. This highlights the importance of borrowers' payment history and the maturity of their credit history in assessing default risk.

  3. Relevance of Employment and Loan Attributes: Features like years at the current job (YOJ), the number of credit lines (CLNO), derogatory reports (DEROG), property value (VALUE), loan amount (LOAN), and mortgage due (MORTDUE) also play significant roles, albeit to a lesser extent compared to DEBTINC, DELINQ, and CLAGE. These factors collectively capture aspects of borrowers' stability, credit behavior, and collateral value.

  4. Limited Influence of Job Type and Loan Reason: Specific job categories (e.g., Self, Office, Mgr) and the reason for the loan (Home Improvement) have much lower importance scores, indicating a relatively minor direct influence on loan default predictions in this model. Notably, some job types (ProfExe, Sales, Other) and one loan reason (DebtCon) show zero importance, suggesting these features do not contribute to the model's decision-making process in this dataset.

  5. Non-uniform Distribution of Feature Importance: The distribution of importance scores is highly skewed towards a few key features, particularly DEBTINC, emphasizing the model's reliance on a subset of highly predictive attributes over others.

Building a Random Forest Classifier¶

Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample a decision tree makes a prediction.

The results from all the decision trees are combined together and the final prediction is made using voting or averaging.

In [ ]:
# Define the Random Forest model
random_forest_model = RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42)
# Create a pipeline
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                 ('model', random_forest_model)])
In [ ]:
model_pipeline.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['LOAN', 'MORTDUE', 'VALUE',
                                                   'YOJ', 'DEROG', 'DELINQ',
                                                   'CLAGE', 'NINQ', 'CLNO',
                                                   'DEBTINC']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['JOB', 'REASON'])])),
                ('model', RandomForestClassifier(random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['LOAN', 'MORTDUE', 'VALUE',
                                                   'YOJ', 'DEROG', 'DELINQ',
                                                   'CLAGE', 'NINQ', 'CLNO',
                                                   'DEBTINC']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['JOB', 'REASON'])])),
                ('model', RandomForestClassifier(random_state=42))])
ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG',
                                  'DELINQ', 'CLAGE', 'NINQ', 'CLNO',
                                  'DEBTINC']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehot',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['JOB', 'REASON'])])
['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
SimpleImputer(strategy='median')
StandardScaler()
['JOB', 'REASON']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(handle_unknown='ignore')
RandomForestClassifier(random_state=42)
In [ ]:
# Make predictions
y_pred = model_pipeline.predict(X_test)

# Evaluate the model
print("Accuracy on test set:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Accuracy on test set: 0.910234899328859
Confusion Matrix:
 [[909  18]
 [ 89 176]]
Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.98      0.94       927
           1       0.91      0.66      0.77       265

    accuracy                           0.91      1192
   macro avg       0.91      0.82      0.86      1192
weighted avg       0.91      0.91      0.90      1192

The performance of the Random Forest Classifier on the same dataset demonstrates a significant enhancement in predicting loan repayment outcomes, achieving an accuracy of 91.02% on the test set. This improvement over the decision tree model's earlier performance is notable in several key areas:

  1. Increased Overall Accuracy: The Random Forest model exhibits a higher overall accuracy (91.02%) compared to the decision tree's accuracy (approximately 88.09%). This indicates a superior ability to correctly classify both paid and defaulted loans.

  2. Improved Precision and Recall for Defaulted Loans: The precision for predicting defaulted loans (class 1) remains high at 91%, similar to the decision tree model. However, the recall for class 1 improves to 66%, up from the decision tree's recall of approximately 72%, indicating that the Random Forest is better at identifying a larger proportion of actual defaults, albeit with a slight decrease compared to the decision tree. This might be due to the balancing effect of the ensemble method, which reduces overfitting and variance.

  3. High Precision and Recall for Paid Loans: The Random Forest model maintains high precision (91%) and an impressive recall (98%) for predicting paid loans (class 0), suggesting a strong capability in identifying loans that will not default. This is an improvement over the decision tree model, which had a slightly lower precision and recall for class 0.

  4. Balanced Performance Across Classes: The macro averages for precision, recall, and F1-score show that the Random Forest model offers a more balanced performance across both classes compared to the decision tree. This balance is crucial for practical applications where both identifying defaults accurately and minimizing false positives are important.

  5. Enhanced F1-Scores Indicate Model Robustness: The F1-scores, which balance precision and recall, are higher for both classes in the Random Forest model, especially notable in the increased F1-score for defaulted loans (class 1) to 77%. This improvement suggests that the Random Forest model is not only accurate but also robust, providing reliable predictions across diverse scenarios.

Comparison with the Decision Tree Model:¶

  • The Random Forest model, an ensemble of decision trees, inherently mitigates some of the decision tree model's limitations, such as susceptibility to overfitting. By aggregating predictions from multiple trees, the Random Forest achieves higher accuracy and a more balanced performance across classes.
  • The improvement in recall for defaulted loans (class 1) in the Random Forest model, compared to the decision tree, underscores the effectiveness of ensemble methods in handling class imbalance, a common challenge in loan default prediction tasks.
  • The higher overall accuracy and balanced class performance of the Random Forest model highlight the benefits of ensemble learning in complex prediction tasks, where single models like a decision tree may struggle to capture the nuances of the data.

In conclusion, the Random Forest Classifier's superior performance reflects its robustness and effectiveness in predicting loan repayment outcomes, making it a preferable choice over a single decision tree for tasks involving complex datasets with imbalanced classes.

Random Forest Classifier Hyperparameter Tuning¶

In [ ]:
param_grid = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [None, 10, 20],
    #'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4, ],
    'model__class_weight': [None, 'balanced']
}

model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(random_state=42, class_weight={0: 1, 1: 3})) #improve recall
])

grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)


grid_search.fit(X_train, y_train)

# Evaluate the best model found by GridSearch
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print("Best model's accuracy:", accuracy_score(y_test, y_pred))
Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best model's accuracy: 0.9085570469798657
In [ ]:
# Evaluate the model
print("Accuracy on test set:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Accuracy on test set: 0.9085570469798657
Confusion Matrix:
 [[901  26]
 [ 83 182]]
Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.97      0.94       927
           1       0.88      0.69      0.77       265

    accuracy                           0.91      1192
   macro avg       0.90      0.83      0.86      1192
weighted avg       0.91      0.91      0.90      1192

In [ ]:
random_forest_model = best_model.named_steps['model']
feature_importances = random_forest_model.feature_importances_
transformed_feature_names = best_model.named_steps['preprocessor'].get_feature_names_out()
feature_importances_df = pd.DataFrame({
    'Feature': transformed_feature_names,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)
plt.figure(figsize=(10, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importances_df.head(20))
plt.title('Top 20 Feature Importances in Random Forest Model')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

The feature importance extracted from your hyper-tuned Random Forest model provides valuable insights into the factors influencing loan repayment predictions. Here are some conclusions and observations:

  1. Dominant Role of Debt-to-Income Ratio (DEBTINC): The most critical feature influencing loan repayment predictions is the borrower's debt-to-income ratio, accounting for approximately 22.15% of the model's decision-making. This highlights the paramount importance of assessing borrowers' financial health and their ability to manage debt relative to their income.

  2. Significance of Credit History: Features related to borrowers' credit history, including the age of the oldest credit line (CLAGE), delinquency records (DELINQ), and derogatory reports (DEROG), are among the top influencers. These attributes collectively underscore the relevance of a borrower's past credit behavior in predicting loan defaults, with CLAGE and DELINQ, in particular, being nearly equally important after DEBTINC.

  3. Loan Amount and Property Value: The loan amount (LOAN) and the property value (VALUE) significantly influence predictions, indicating that the size of the loan and the value of the collateral are key factors in assessing loan risk.

  4. Credit Lines and Mortgage Due: The number of credit lines (CLNO) and the mortgage due (MORTDUE) also play substantial roles, suggesting that the breadth of a borrower's credit relationships and the outstanding mortgage amount are pertinent to their likelihood of loan repayment.

  5. Years at Job and Number of Inquiries: The years at the current job (YOJ) and the number of recent credit inquiries (NINQ) demonstrate a meaningful impact, albeit to a lesser extent compared to the top factors. These features reflect the stability of the borrower's employment and their recent search for credit, which can influence their repayment capability.

  6. Influence of Job Type and Loan Reason: Categorical variables related to the borrower's job type and the reason for the loan (DebtCon for debt consolidation, HomeImp for home improvement) show lower importance scores. However, their presence in the list indicates that these aspects, while not as critical as financial metrics, still provide valuable context for predicting loan outcomes.

  7. Minor Variations Among Job Types and Loan Reasons: The relatively close importance scores among different job types and loan reasons suggest a nuanced effect of these factors on loan repayment predictions. The model does not heavily favor one specific job type or loan reason over another, indicating a more balanced consideration of these attributes.

1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):

  • How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?
Model Accuracy Precision (Class 0) Precision (Class 1) Recall (Class 0) Recall (Class 1) F1-score
Logistic Regression 81.96% 84% 66% 94% 39% 0.80
Decision Tree 86.91% 90% 74% 94% 64% 0.87
Hypertuned Decision Tree 88.09% 92% 74% 93% 72% 0.88
Random Forest 91.02% 91% 91% 98% 66% 0.91
Hypertuned Random Forest 90.86% 92% 88% 97% 69% 0.91

Overall, the Random Forest model, both basic and hypertuned, performs relatively better compared to other techniques, with higher accuracy and better balance between precision and recall for both classes.

2. Refined insights:

  • What are the most meaningful insights relevant to the problem?
  1. Critical Features Across Models:

    • Debt-to-Income Ratio (DEBTINC) consistently appears as a top predictor across all models, underscoring its pivotal role in assessing borrowers' financial health and their capacity to manage additional debt.
    • Credit History Attributes like the age of the oldest credit line (CLAGE), delinquencies (DELINQ), and derogatory reports (DEROG) are crucial across models, highlighting the importance of a borrower's past credit behavior in predicting loan defaults.
    • Loan Characteristics such as loan amount (LOAN) and property value (VALUE) have also been identified as significant predictors, emphasizing the role of the loan's size and the collateral's value in the risk assessment.
  2. Model Performance and Complexity:

    • The Random Forest model exhibited the highest accuracy and a balanced performance across classes, demonstrating the power of ensemble methods in capturing complex relationships and reducing overfitting compared to a single decision tree.
    • Logistic Regression provided a baseline with insights into linear relationships between features and outcomes, offering interpretability but potentially missing out on capturing more complex patterns that ensemble methods or decision trees can identify.
    • Decision Trees offered a middle ground with the advantage of interpretability and the ability to capture non-linear relationships, although susceptible to overfitting when not properly tuned.
  1. Implications for Lending Practices:
    • These models highlight the necessity of a holistic view in assessing loan applications, where a combination of financial health indicators, credit history, and loan-specific factors are considered together.
    • The significant features identified by the models can guide lenders in developing more nuanced risk assessment criteria, potentially leading to more accurate predictions of loan repayment and default.
    • The insights also suggest areas for further data collection and analysis, such as more detailed aspects of borrowers' financial situations or alternative data sources that could provide additional predictive power.

3. Proposal for the final solution design:

Based on the analysis and results obtained from the logistic regression, decision tree, and random forest models, adopting the Random Forest model as the best solution for predicting loan repayment outcomes. The decision is rooted in several key factors that make it particularly suited for this problem:

  1. Balanced Performance:

    • The Random Forest model demonstrated the highest overall accuracy and presented a balanced performance across precision and recall metrics, particularly for the minority class (defaulted loans). This balance is crucial in the lending context, where both minimizing false negatives and false positives are important.
  2. Handling of Complex Relationships:

    • Random Forests effectively capture complex, non-linear relationships between features without requiring extensive feature engineering or transformation. This capability allows the model to leverage the underlying patterns in the data more effectively than logistic regression, which assumes linear relationships, and single decision trees, which may overfit to the training data.
  3. Robustness to Overfitting:

    • By aggregating predictions across many decision trees, Random Forests reduce the risk of overfitting, making the model more generalizable to unseen data compared to a single decision tree.
  4. Importance of Features:

    • The Random Forest model provides valuable insights into feature importance, helping identify key predictors of loan default. This information can be used not only to improve model predictions but also to inform risk management strategies and financial product development.
  5. Flexibility and Scalability:

    • Random Forests offer flexibility in tuning and can scale well with the addition of more data, making them a robust choice for ongoing application. The ability to adjust class weights and other hyperparameters allows the model to be fine-tuned for specific operational requirements or changing economic conditions.

Conclusion:

The Random Forest model strikes an effective balance between accuracy, interpretability, and operational feasibility, making it the best solution among the models considered. Its superior performance metrics, coupled with its robustness and flexibility, position it as a valuable tool for predicting loan repayment outcomes. Adopting this model can enhance the ability of financial institutions to assess loan risk more accurately, leading to better lending decisions, reduced default rates, and potentially more competitive financial products. Continuous monitoring and periodic re-tuning of the model will ensure its relevance and effectiveness in the dynamic landscape of credit risk assessment.

Executive Summary¶

Objective and Key Findings

Our comprehensive analysis aimed at enhancing loan default prediction models has led to significant insights and the identification of a robust predictive tool. Key findings include:

  1. Critical Predictive Factors: The Debt-to-Income Ratio (DEBTINC) was identified as the most influential predictor of loan defaults. Other critical factors include credit history attributes like the age of the oldest credit line (CLAGE), delinquencies (DELINQ), and derogatory reports (DEROG). Loan amount (LOAN) and property value (VALUE) were also significant, affecting the risk assessment outcomes.

  2. Optimal Model Selection: Among various models tested, the Random Forest model outperformed others, exhibiting superior accuracy and balance between recall and precision. This model effectively captures complex data relationships and provides robust predictability without overfitting.

  3. Insightful Data Visualizations: Visual analyses reinforced the quantitative findings, showing clear distinctions in financial behaviors between defaulting and non-defaulting applicants. These insights are crucial for understanding the underlying patterns that influence loan outcomes.

Final Model Specifications

The chosen Random Forest Classifier demonstrates excellent capability in handling the complexities of loan default predictions. It is characterized by:

  • 200 Trees to ensure a comprehensive ensemble approach.
  • Unrestricted Tree Depth for full growth and detailed data capture.
  • Balanced Class Weights to enhance focus on minority classes and improve model fairness.
  • Advanced Preprocessing Techniques including one-hot encoding for categorical variables and meticulous handling of missing data.

Strategic Recommendations and Next Steps

  1. Model Enhancement: Continue refining the Random Forest model through hyperparameter tuning.

  2. Feature Expansion: Investigate additional features and interactions that may uncover deeper insights into default risks. Update the model periodically to adapt to new economic conditions or data.

  3. Deployment and Monitoring: Implement the model within the existing loan processing infrastructure for real-time assessments and establish a system for ongoing performance monitoring and model updates.

  4. Regulatory Compliance and Ethics: Regularly review the model for compliance with financial regulations and ethical standards, ensuring it remains free from biases and maintains fairness across different borrower demographics.

  5. Continued Research and Development: Foster ongoing research into new data sources and predictive technologies to stay ahead of market trends and economic shifts.

This strategic approach ensures that our predictive modeling not only enhances financial decision-making but also aligns with industry best practices and regulatory standards, thereby supporting sustainable and profitable lending operations.

Problem and Solution Summary¶

Summary of the Problem

The primary challenge faced by the financial sector, specifically in the domain of retail banking, revolves around accurately predicting loan defaults. This issue is critical as defaults significantly impact a bank’s profitability and operational efficiency. Traditionally, the loan approval process has relied heavily on manual assessment, which not only consumes substantial time and resources but is also prone to human error and bias. In an era where financial markets are rapidly evolving and consumer credit profiles are becoming increasingly complex, traditional methods have shown limitations in effectively predicting loan repayment behaviors.

Key Points of the Final Proposed Solution Design

The proposed solution involves the deployment of a sophisticated Random Forest Classifier model, designed with the following key attributes:

  • Enhanced Predictive Accuracy: Utilizing a Random Forest approach enables the model to handle the intricacies of large and diverse datasets, improving accuracy in predicting defaults.
  • Balanced Class Weighting: Adjusting the weights for the minority class helps address the common issue of class imbalance in loan default data, improving the sensitivity of the model towards actual defaults.
  • Comprehensive Feature Utilization: By leveraging crucial predictors such as Debt-to-Income Ratio (DEBTINC), credit history (CLAGE, DELINQ, DEROG), and loan and property values, the model provides a holistic view of an applicant's financial health.
  • Advanced Data Handling Techniques: Incorporating robust preprocessing methods to manage outliers, missing data, and categorical variable encoding ensures that the model operates on clean and well-structured data.

Reason for the Proposed Solution Design

The design of this model is motivated by the need to enhance the efficiency and accuracy of the loan approval process. Random Forest was selected due to its ability to perform well with complex datasets that feature nonlinear relationships and interactions among variables. Its ensemble nature also mitigates the risk of overfitting, making it more reliable for operational use. The model’s ability to provide feature importance rankings further aids in refining risk assessment processes and developing more targeted financial products.

Impact on the Problem/Business

Implementing this solution would have a transformative effect on the business by:

  • Reducing Default Rates: By accurately predicting potential defaults, the bank can take preemptive measures to mitigate risk, potentially saving significant amounts in lost revenue.
  • Increasing Operational Efficiency: Automating the risk assessment part of the loan approval process reduces the workload on human analysts and speeds up decision-making, leading to faster loan processing and enhanced customer satisfaction.
  • Enhancing Decision-Making: The insights gained from the model regarding key predictive factors can inform strategic adjustments in loan offering practices and risk management policies.
  • Building Customer Trust: A more accurate and transparent loan approval process helps in building trust and maintaining a strong customer relationship, which is crucial for customer retention and satisfaction.

In conclusion, the adoption of the Random Forest Classifier model represents a strategic move towards leveraging advanced analytics to improve financial decisions and risk management in banking operations, aligning with broader trends towards data-driven decision-making in the industry.

Recommendations for Implementation¶

Key Recommendations to Implement the Solution

  1. Integration with Existing Systems:

    • Seamlessly integrate the Random Forest model into the existing loan processing system. This may involve developing an API that allows the model to receive input data from the loan application interface and return predictions.
  2. Training and Validation:

    • Conduct extensive training sessions for credit analysts and decision-makers to understand and effectively utilize the model outputs. Ensure robust validation and testing phases are conducted with historical data to confirm model accuracy before full deployment.
  3. Continuous Monitoring and Updating:

    • Establish protocols for continuous monitoring of the model's performance. This includes setting up regular updates and retraining schedules to adapt to new data and changing market conditions.

Key Actionables for Stakeholders

  1. IT and Data Science Teams:

    • Develop and maintain the integration infrastructure. Ensure data flows correctly between the model and the loan processing system and manage the initial deployment and ongoing operation of the model.
  2. Risk Management Teams:

    • Utilize model outputs to refine risk assessment protocols and loan approval criteria. Collaborate with data science teams to understand model insights and implications for risk management strategies.
  3. Executive Leadership:

    • Approve budgets and resources for the project. Facilitate cross-departmental collaboration and oversee the strategic alignment of the model deployment with business objectives.

Key Risks and Challenges

  • Model Overfitting or Bias: Risk of the model being overfitted to historical data or inheriting biases from training datasets.
  • Integration Challenges: Potential technical difficulties in integrating the model with existing banking systems, which could disrupt loan processing.
  • Adaptability to Economic Changes: The model might not adapt quickly to sudden economic changes or rare events not represented in the training data.
  • Additional Data Sources: Explore additional data sources that could enhance model predictions, such as real-time economic indicators or alternative credit data.

By addressing these recommendations, actionables, and potential risks, the implementation of the Random Forest model can be effectively managed to maximize its benefits and ensure it contributes positively to the organization's strategic objectives.